A Long-Term Archival Pipeline for the Forschungsdatenplattform Stadt.Geschichte.Basel

omeka2dsp

Moritz Mähr

University of Basel

University of Bern

Moritz Twente

University of Basel

2025-10-15

Stadt.Geschichte.Basel

  • Large-scale historical research project, initiated in 2011 by the Association for Basel History and carried out 2017–2026 at the University of Basel
  • More than 70 researchers studying the history of Basel from the earliest settlements to the present day
  • Funded with more than 9 million Swiss francs by the Canton of Basel-Stadt, the Lottery Fund, and private sponsors
  • Specialized team for research data management and public history
  • Various research outputs, including books, papers, data stories, figures, and source code

Research Data

flowchart LR
  subgraph Research[Research]
    Publications[📚 Publications]
    Data[📊 Statistical & Geo Data]
    Code[💻 Source Code]
  end

Collecting and Managing research data

flowchart LR
  subgraph Research[Research]
    Publications[📚 Publications]
    Data[📊 Statistical & Geo Data]
    Code[💻 Source Code]
  end

  subgraph Repositories
    Omeka[(📁 omeka.unibe.ch)]
    GitHub[(🐙 GitHub)]
  end

  %% Flows
  Publications -- figures --> Omeka
  Data -- visualizations --> Omeka
  Code --> GitHub

Public history with research data

flowchart LR
  subgraph Research[Research]
    Publications[📚 Publications]
    Data[📊 Statistical & Geo Data]
    Code[💻 Source Code]
  end

  subgraph Repositories[Repositories]
    Omeka[(📁 omeka.unibe.ch)]
    GitHub[(🐙 GitHub)]
  end

  subgraph PublicWeb[Public History Websites]
    RDP[forschung.stadtgeschichtebasel.ch]
  end

  %% Flows
  Publications -- figures --> Omeka
  Data -- visualizations --> Omeka
  Code --> GitHub
  Omeka -- API --> RDP
  GitHub -- static site generator --> RDP

Archiving research data for the long term

flowchart LR
  subgraph Research[Research]
    Publications[📚 Publications]
    Data[📊 Statistical & Geo Data]
    Code[💻 Source Code]
  end

  subgraph Repositories[Repositories]
    Omeka[(📁 omeka.unibe.ch)]
    GitHub[(🐙 GitHub)]
  end

  subgraph PublicWeb[Public History Websites]
    RDP[Research Data Platform<br>forschung.stadtgeschichtebasel.ch]
  end

  subgraph Archives[Long-term Archives]
    Zenodo[(📦 Zenodo)]
    DaSCH[(🏛️ DaSCH)]
    UBBasel[(📚 University Library Basel)]
  end

  %% Flows
  Publications -- figures --> Omeka
  Publications -- books --> UBBasel
  Publications -- other publications --> Zenodo
  Data -- visualizations --> Omeka
  Omeka -- API --> RDP
  GitHub -- Static Site Generator --> RDP
  Code --> GitHub
  GitHub --> Zenodo
  Omeka --> DaSCH

Long-term preservation

  • Institutional sustainability: External dependencies (Omeka at UniBern) may not be permanently funded
  • FAIR principles: Research data must be Findable, Accessible, Interoperable, and Reusable with proper metadata
  • Citability: Researchers need persistent identifiers (PIDs) (stable URLs or DOIs) to reference historical materials
  • Scalability: Current minimal computing approach (GitHub Pages) cannot handle large files

Where DaSCH fits in: omeka2dsp

flowchart LR
  subgraph Research[Research]
    Publications[📚 Publications]
    Data[📊 Statistical & Geo Data]
  end

  subgraph Repositories
    Omeka[(📁 omeka.unibe.ch)]
  end

  subgraph Archives[Long-term Archives]
    DaSCH[(🏛️ DaSCH)]
  end

  %% Flows
  Publications -- figures --> Omeka
  Data -- visualizations --> Omeka
  Omeka --> DaSCH

Challenges

  • Data model differences (Omeka vs DaSCH)
  • Metadata transformation and crosswalks
  • Automation of deposit and version control

 Data model differences

Omeka

 Data model Omeka (simplified 😇)

classDiagram
  %% Core Omeka S entities (reduced)
  class Item {
    o:id : int
    o:is_public : bool
    o:title : string
    dcterms:identifier : string
    dcterms:subject[*] : IconclassTerm
    dcterms:temporal : Era
    dcterms:language : ISO639
    o:created : datetime
    o:modified : datetime
  }

  class Media {
    o:id : int
    o:item_id : int
    o:ingester : string
    o:media_type : MIME
    o:original_url : uri
    o:sha256 : hash
    dcterms:creator[*] : uri|text
    dcterms:date : string~EDTF
    dcterms:license : LicenseURI
    dcterms:rights : text
  }

  class ItemSet {
    o:id : int
    o:label : string
  }

  %% Controlled vocabularies as types
  class Era { <<type>> }
  class MIME { <<type>> }
  class LicenseURI { <<type>> }
  class IconclassTerm {
    <<external scheme>>
    code : string
    label : string
  }
  class ISO639 {
    <<code>>
    value : string
  }

  %% Relations and cardinalities
  Item "1" o-- "0..*" Media : has media
  Media "*" --> "1" Item : belongs to
  Item "*" o-- "0..*" ItemSet : in set(s)
  Item --> "0..*" IconclassTerm : subjects
  Media --> "0..*" IconclassTerm : subjects
  Item --> "1" Era : temporal
  Media --> "1" Era : temporal
  Media --> MIME : media_type
  Media --> LicenseURI : license
  Item --> ISO639 : language

Data model DaSCH (simplified 😅)

classDiagram
class Parent
class Document
class ResourceWithoutMedia
class Image

class SubjectList
class LanguageList
class TypeList
class FormatList
class TemporalList
class LicenseList

%% Core links to Parent object
Document "0..1" --> "1" Parent : linkToParentObject
ResourceWithoutMedia "0..1" --> "1" Parent : linkToParentObject
Image "0..1" --> "1" Parent : linkToParentObject

%% Value-list relations
Parent "1" --> "1" TemporalList : hasTemporalList
Parent "1" --> "0..*" SubjectList : hasSubjectList
Parent "1" --> "0..1" LanguageList : hasLanguageList

Document "1" --> "0..1" SubjectList : hasSubjectList
Document "1" --> "0..1" TemporalList : hasTemporalList
Document "1" --> "0..1" TypeList : hasTypeList
Document "1" --> "0..1" FormatList : hasFormatList
Document "1" --> "0..1" LanguageList : hasLanguageList
Document "1" --> "0..1" LicenseList : hasLicenseList

ResourceWithoutMedia "1" --> "0..1" SubjectList : hasSubjectList
ResourceWithoutMedia "1" --> "0..1" TemporalList : hasTemporalList
ResourceWithoutMedia "1" --> "0..1" TypeList : hasTypeList
ResourceWithoutMedia "1" --> "0..1" FormatList : hasFormatList
ResourceWithoutMedia "1" --> "0..1" LanguageList : hasLanguageList
ResourceWithoutMedia "1" --> "0..1" LicenseList : hasLicenseList

Image "1" --> "0..1" SubjectList : hasSubjectList
Image "1" --> "0..1" TemporalList : hasTemporalList
Image "1" --> "0..1" TypeList : hasTypeList
Image "1" --> "0..1" FormatList : hasFormatList
Image "1" --> "0..1" LanguageList : hasLanguageList
Image "1" --> "0..1" LicenseList : hasLicenseList

Key differences

  • Modeling approach:

    • DaSCH: Class hierarchy (ResourceDocument / Image …), explicit value classes (TextValue, ListValue).
    • Omeka S: Flat JSON-LD model (Item, Media, ItemSet), Dublin Core–centric.
  • Normalization & constraints:

    • DaSCH: Strict cardinalities and mandatory fields (hasTitle [1]).
    • Omeka S: More flexible, “validation” through templates.
  • Hierarchy representation:

    • DaSCH: Explicit Parent class and linkToParentObject.
    • Omeka S: Uses field ItemSet or Media to model relations.

Metadata transformation

Data validation

  • Custom Python scripts using pydantic for schema validation as Omeka does not enforce strict validation
  • Validation also helps identify data quality issues
  • Lots of manual cleaning required, all done in Omeka

Crosswalk Examples

Omeka Property DaSCH Property Notes
dcterms:title hasTitle Required in DaSCH (cardinality 1)
dcterms:identifier hasIdentifier Not by DSP used for stable references
dcterms:subject hasSubjectList Iconclass codes preserved
dcterms:temporal hasTemporalList Era mapping required
dcterms:creator hasCreatorList Multiple creators supported
dcterms:license hasLicenseList License URIs validated
Media → Item linkToParentObject Hierarchy explicitly modeled

Version control and updates

  • Challenge: DaSCH supports versioning, but requires careful planning
  • Strategy:
    • Initial deposit: Create new resources via REST API
    • Updates: Use PUT requests with resource IDs to create new versions
    • Identifier stability: ARK IDs remain constant across versions

API vs DSP-Tools

REST API Approach

  • Direct HTTP requests
  • Fine-grained control
  • Supports versioning
  • Complex error handling
  • Used for updates

DSP-Tools Approach

  • XML-based bulk import
  • Good for initial deposits
  • Less flexible for updates
  • Comprehensive validation
  • Used for large-scale ingestion

Our choice: Using dsp-tools for project setup and Rest API for ingestions/updates

omeka2dsp

Lessons Learned

  • Metadata crosswalks and data quality: Mapping Dublin Core to the DaSCH ontology required custom logic and validation. Omeka’s flexible schema led to inconsistent metadata, demanding extensive cleaning.
  • Identifiers and file handling: Synchronizing identifiers between Omeka and DaSCH proved complex. Large files (>100 MB) necessitated chunked uploads and tailored handling.
  • Workflow timing and validation: Determining the right moment to shift from active curation to archival mode was key. Early validation with pydantic schemas and subset testing prevented costly errors.
  • Documentation and reproducibility: Precise mapping documentation ensured consistency across transformations and supported reproducible workflows.
  • Architecture and infrastructure: Decoupling archival (DaSCH) from presentation (CollectionBuilder) enhanced flexibility. Lightweight public interfaces can coexist with robust preservation systems when APIs enable automation.

Key Takeaways

For the Community

  • Lightweight publishing can work with robust infrastructure
  • FAIR principles are achievable in practice
  • Decoupled architectures provide flexibility
  • Open-source tools enable customization

For DaSCH Users

  • Plan metadata transformation early
  • Use validation extensively
  • Consider hybrid API/DSP-Tools approach
  • Document mapping decisions thoroughly

Resources